Use UnsafeAtomics to fix race in accumulate #44

vchuravy · 2025-06-15T18:29:28Z

vchuravy · 2025-06-16T09:56:57Z

This alone is not sufficient to resolve the data-race.

vchuravy · 2025-06-25T07:27:08Z

@anicusan I currently don't have a MWE that triggers anymore, but I believe this is closer to the spirit of what you intended to implement?

giordano · 2025-06-30T19:57:58Z

This fixes #46 on AMGPU MI300 as per #46 (comment)

This reverts commit 2da8696.

anicusan · 2025-07-04T02:10:49Z

Apologies for the radio silence, I've been away for a conference. This is extremely useful, thank you for digging into this @vchuravy and @giordano . Two questions from my side:

Is UnsafeAtomics stable / will it be supported in KA in the future?
Is there any reason you hard-coded UInt8 as the flag type? It is the smallest and simplest, but if a user wants to supply some temporary buffer they have lying around, it could work with any integer, right?

src/accumulate/accumulate_1d_gpu.jl

vchuravy · 2025-07-08T09:51:26Z

UnsafeAtomics is stable and will be supported by JuliaGPU (it's the underpinning of Atomix).

Is there any reason you hard-coded UInt8 as the flag type? It is the smallest and simplest, but if a user wants to supply some temporary buffer they have lying around, it could work with any integer, right?

Yeah, store! doesn't convert the flag to the eltype of the pointer. But I added that conversion into the code here.

src/accumulate/accumulate_1d_gpu.jl

christiangnrd · 2025-07-08T14:22:30Z

Might be worth a rebase since there have been changes to the accumulate tests pushed to main.

Metal compilation fails because it only supports 32-bit integers (and float) for these atomic operations. However, DecoupledLookback tests failed when I changed the flag to use UInt32 so maybe we should reinstate ScanPrefixes as the default for that platform?

Finally, I ran this on a 3080, and I got 1 failed test (test/accumulate.jl line 47 or 51 after rebase). I couldn't reproduce until I increased the # iterations for the "small block sizes -> many blocks" test to 10000 and then I get <10 failures.

vchuravy · 2025-07-08T14:37:45Z

I really would prefer not undoing the Metal change without understanding the why. The memory semantics ought to be the same, and a we have seen with more capable micro architecture the latent bugs in this algorithm show up.

You are seeing failures at large enough problem sizes on CUDA as well?

christiangnrd · 2025-07-08T15:25:17Z

You are seeing failures at large enough problem sizes on CUDA as well?

I am. Also reproduces on the un-rebased version of this PR. Either more iterations or bigger arrays will increase the odds of triggering the bug.

anicusan · 2025-07-15T19:40:44Z

The Metal backend will not be able to support the DecoupledLookback algorithm - that was the primary reason for developing ScanPrefixes (issue / PR).

I realize I made a mistake when reading the Metal docs and unintentionally assumed more functionality than is guaranteed. threadgroup_barrier in MSL is effectively OpenCL C 1.2's barrier intrinsic which only has a memory scope of Workgroup. This means (in the context of gpuweb/gpuweb#2229) that inter-workgroup communication cannot be support correctly on all underlying platforms. (ref)

The decoupled lookback assumes that changes made to a device array (flags and v) from a workgroup will be visible from other workgroups. From the above discussions it seems that Metal does not offer such barriers at all.

We have large-scale accumulate tests though - at least enough to saturate the SMs - and I have not been able to reproduce accumulate bugs on CUDA / AMD; @christiangnrd do you have an MWE?

@vchuravy on oneAPI we seem to have the following atomics error:

error: undefined reference to `_Z18__spirv_AtomicLoadPU3AS1cii'
--
  | │ in function: '__spirv_AtomicLoad(char AS1*, int, int)' called by kernel: 'gpu__accumulate_previous_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64> > >, NDRange<1, DynamicSize, StaticSize<_256__>, CartesianIndices<1, Tuple<OneTo<Int64> > >, void> >, _, oneDeviceArray<Int32, 1, 1>, _<UInt8, 1, 1>, Int32)'

Any ideas?

vchuravy · 2025-07-15T20:12:24Z

We widen the semantics on Metal recently to match semantics across all backends. JuliaGPU/Metal.jl#609

So either this algorithm is sound everywhere or it is sound nowhere.

For oneAPI @maleadt do you know of the top of your head which atomics are legal?

anicusan · 2025-07-15T20:20:22Z

Are you sure the Metal Shading Language supports that? See this post:

Decoupled look-back cannot run on Metal either, which is the underlying reason for the above. Also surprising, because I had imagined otherwise from reading the Metal Shading Language Specification. I believe that language will also be clarified.

anicusan · 2025-07-16T13:21:33Z

Further to the above, from the gpuweb/gpuweb#2297 discussion (who had to change their scan because of Metal alone):

Metal's mem_fence_flags only affects the storage aspect of memory semantics, not the memory scope

it means Metal's threadgroup_barrier doesn't reliably support message passing between one threadgroup (workgroup) and another via device memory

about what's needed for message passing across workgroups (where the message can't be packed in a 32 bit word along with the flags), no, from what I can tell it can't be done in Metal at all (confirmed by testing and discussions with experts)

Also, in the MSL v4 Spec, Table 6.13. "Memory flag enumeration values for barrier functions" shows (bold mine):

mem_device | The flag ensures the GPU correctly orders the memory operations to device memory for threads in the threadgroup or simdgroup.

So your addition in JuliaGPU/Metal.jl#609 was correct anyways, but device memory writes are still only guaranteed to be visible from/within workgroups, not across them.

I can't really reduce the problem to a smaller MWE than the _accumulate_previous! function - decoupled lookback really is the MWE for Apple Metal not allowing message passing across workgroups (as shown in the post and GPUWeb discussion above).

What is left to do for this PR:

Add back the extension with alg=ScanPrefixes() for Metal only.
Make the atomic load/stores work on oneAPI.

anicusan · 2025-07-16T13:21:39Z

Related, @christiangnrd what did your benchmarks show regarding DecoupledLookback vs ScanPrefixes? The former has lower asymptotic complexity (for many blocks, so should be more scalable) and only two kernel launches, while the latter has three kernel launches but possibly a lower time constant. On a small benchmark with 1 million Float32 on Google Colab I see:

CUDA.jl accumulate:     207.351 μs
AK DecoupledLookback:   182.458 μs
AK ScanPrefixes:        101.358 μs

christiangnrd · 2025-07-16T15:02:31Z

MWE:

using Test, CUDA
import AcceleratedKernels as AK

for i in 1:10000
    num_elems = rand(1:100_0000)
    x = CuArray(rand(Int32(1):Int32(1000), num_elems))
    y = copy(x)
    AK.accumulate!(+, y; init=0)

    res = Array(y) .== accumulate(+, Array(x))

    passed = all(res)
    if !passed
        @info sum(.!res)
        # @info i,collect(1:num_elems)[.!res]
    end
    @test passed
end

christiangnrd · 2025-07-16T15:20:58Z

what did your benchmarks show regarding DecoupledLookback vs ScanPrefixes?

Algorithm	Float32 512k	Float32 3M	Int64 512k	Int64 3M
CUDA.jl accumulate	98.160 μs	421.993 μs	106.651 μs	447.884 μs
AK DecoupledLookback	89.881 μs	453.07 μs	119.161 μs	600.878 μs
AK ScanPrefixes	53.120 μs	264.760 μs	94.610 μs	454.240 μs

So results are closer with Int64 than Float32, but DecoupledLookback is slightly better/worse or much worse than the other 2.

Run on a 3060.

anicusan · 2025-07-21T06:49:59Z

It seems ScanPrefixes is almost always faster than DecoupledLookback and the CUDA.jl base implementation - and it doesn't depend on fickle cross-workgroup message passing. I will merge this into vc/accumulate_alg, then keep the switch to ScanPrefixes for all platforms, and remove ext/Metal.

vchuravy requested a review from anicusan June 25, 2025 07:26

vchuravy marked this pull request as ready for review June 25, 2025 07:26

giordano mentioned this pull request Jun 30, 2025

Non-deterministic zeros in findall result on ROCBackend with large arrays #46

Closed

vchuravy mentioned this pull request Jul 1, 2025

Switch default algorithm for accumulate to ScanPrefixes #48

Merged

vchuravy added 3 commits July 1, 2025 08:50

Revert "Switch default algorithm for accumulate to ScanPrefixes"

fef83f2

This reverts commit 2da8696.

attempt to use UnsafeAtomics to fix race in accumulate

fea0c5c

Switch Metal back to the DecoupledLookback algorithm

d4698ab

vchuravy force-pushed the vc/unsafe_atomics branch from c43a4a3 to d4698ab Compare July 1, 2025 06:53

vchuravy changed the base branch from main to vc/accumulate_alg July 1, 2025 06:53

vchuravy changed the title ~~attempt to use UnsafeAtomics to fix race in accumulate~~ Use UnsafeAtomics to fix race in accumulate Jul 1, 2025

anicusan reviewed Jul 4, 2025

View reviewed changes

src/accumulate/accumulate_1d_gpu.jl Outdated Show resolved Hide resolved

allow for temp_flags of bigger size

bc75725

vchuravy requested a review from anicusan July 8, 2025 09:51

vchuravy commented Jul 8, 2025

View reviewed changes

src/accumulate/accumulate_1d_gpu.jl Outdated Show resolved Hide resolved

Update src/accumulate/accumulate_1d_gpu.jl

25d035b

anicusan merged commit 1b17354 into vc/accumulate_alg Jul 21, 2025
37 of 38 checks passed

Use UnsafeAtomics to fix race in accumulate #44

Use UnsafeAtomics to fix race in accumulate #44

Uh oh!

Conversation

vchuravy commented Jun 15, 2025

Uh oh!

vchuravy commented Jun 16, 2025

Uh oh!

vchuravy commented Jun 25, 2025

Uh oh!

giordano commented Jun 30, 2025

Uh oh!

anicusan commented Jul 4, 2025

Uh oh!

Uh oh!

vchuravy commented Jul 8, 2025

Uh oh!

Uh oh!

christiangnrd commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vchuravy commented Jul 8, 2025

Uh oh!

christiangnrd commented Jul 8, 2025

Uh oh!

anicusan commented Jul 15, 2025

Uh oh!

vchuravy commented Jul 15, 2025

Uh oh!

anicusan commented Jul 15, 2025

Uh oh!

anicusan commented Jul 16, 2025

Uh oh!

anicusan commented Jul 16, 2025

Uh oh!

christiangnrd commented Jul 16, 2025

Uh oh!

christiangnrd commented Jul 16, 2025

Uh oh!

anicusan commented Jul 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

christiangnrd commented Jul 8, 2025 •

edited

Loading